Alfred Job Network GraphΒΆ

This notebook builds on the proposal in FinalProjectProposa_Draft.pdf and uses job data stored in the alfred_db PostgreSQL database. We will extract the latest postings generated by Alfred, engineer job & skill relationships, and analyze the resulting graph structure.

Workflow OverviewΒΆ

There are some companian files in this project that help with runnign this notebook: 'artifacts/Jobs_clean.csv' outputs cleaned job data from the database 'artifacts/jobs_skills_map.csv' exploded job-skill mapping data 'artifacts/jobs_clean.parquet' parquet version of cleaned job data 'artifacts/jobs_skill_map.parquet' parquet version of job-skill mapping data

I am pulling directly from the database in this notebook, but these files can be used to skip that step if needed.

If your are pulling from the database, the workflow is as follows:

  1. Import necessary libraries for data manipulation, database connection, and graph analysis.
  2. Load environment variables (or manually provide credentials) that describe how to reach alfred_db.
  3. Connect to PostgreSQL, pull the jobs table into a pandas DataFrame, and perform light cleaning.
  4. Use spaCy-based NLP to infer skill phrases directly from each job description.
InΒ [1]:
%pip install -q python-dotenv sqlalchemy psycopg2-binary pandas networkx matplotlib plotly-express tqdm spacy scikit-learn
Note: you may need to restart the kernel to use updated packages.
InΒ [2]:
# Core analysis imports plus visualization + progress helpers used throughout the notebook.
import os
from pathlib import Path
from typing import List, Tuple

import pandas as pd
from sqlalchemy import create_engine, text
from sqlalchemy.engine import Engine
from dotenv import load_dotenv
import networkx as nx
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from tqdm.auto import tqdm
import networkx.algorithms.community as nx_comm
from collections import defaultdict
import re
from typing import List, Tuple
import itertools
from collections import Counter

import spacy
from spacy.cli import download as spacy_download
from sklearn.cluster import MiniBatchKMeans
from sklearn.feature_extraction.text import TfidfVectorizer

pd.set_option("display.max_colwidth", 100)
pd.set_option("display.max_rows", 20)
InΒ [Β ]:
import plotly.io as pio
pio.renderers.default = 'png'  # ensure Plotly renders as static images for exports

Load Environment Variables and Create a Database EngineΒΆ

I am Storing my database credentials in a .env file for security and convenience. But for your covenieence you can start at the section Titles:"FROM CSV FILES" below to skip the database connection step.

InΒ [3]:
PROJECT_ROOT = Path.cwd()
ALFRED_ROOT = (
    PROJECT_ROOT.parents[0] / "alfred"
).resolve()  # Look one level up for the Alfred repo.

# Load whichever .env files exist so DATABASE_URL and friends are available without manual export.
candidate_env_files = [PROJECT_ROOT / ".env", ALFRED_ROOT / ".env"]
for env_path in candidate_env_files:
    if env_path.exists():
        load_dotenv(env_path, override=False)
        print(f"Loaded environment variables from {env_path}")

DATABASE_URL = os.getenv("DATABASE_URL")
if not DATABASE_URL:
    raise RuntimeError(
        "DATABASE_URL is not set. Create a .env file or export the variable before running this notebook."
    )

# Create a reusable SQLAlchemy engine that downstream cells can share.
engine: Engine = create_engine(DATABASE_URL)
engine
Loaded environment variables from d:\Data620\FinalProject\.env
Out[3]:
Engine(postgresql+psycopg2://alfred:***@localhost:5432/alfred_db)

Extract the Latest Jobs Table SnapshotΒΆ

We query the core columns required for downstream analysis. Adjust the SQL to join additional tables once they are available (e.g., embeddings, generated resume artifacts, or recruiter contact data).

InΒ [4]:
def fetch_jobs(engine: Engine) -> pd.DataFrame:
    """Return the latest Alfred job postings needed for the network analysis."""
    query = text(
        """
        SELECT
            id,
            title,
            company,
            location,
            description,
            source_url,
            match_score
            
        FROM jobs
        ORDER BY id DESC;
        """
    )  # Adjust the SQL when you want to join extra tables/columns.
    return pd.read_sql_query(query, engine)


# Pull the raw dataset once so downstream wrangling operates on an in-memory DataFrame.
jobs_raw = fetch_jobs(engine)
print(f"Retrieved {len(jobs_raw):,} job postings from Alfred.")
jobs_raw.head()
Retrieved 1,208 job postings from Alfred.
Out[4]:
id title company location description source_url match_score
0 1370 Data Scientist Insight Global South Camden, Camden County Job Description The Cooper Health System is seeking a Data Scientist to join its analytics team.... https://www.adzuna.com/land/ad/5513717171?se=mvDVO3HX8BGRNdJTGBOSjQ&utm_medium=api&utm_source=1f... NaN
1 1369 Data Scientist Home Depot Atlanta, Fulton County Position Purpose: The Data Scientist is responsible for supporting data science initiatives that... https://www.adzuna.com/land/ad/5522795794?se=mvDVO3HX8BGRNdJTGBOSjQ&utm_medium=api&utm_source=1f... NaN
2 1368 Data Scientist Stanford University Stanford, Santa Clara County Data Scientist Graduate School of Education, Stanford, California, United States Information Ana... https://www.adzuna.com/land/ad/5454326822?se=mvDVO3HX8BGRNdJTGBOSjQ&utm_medium=api&utm_source=1f... NaN
3 1367 Data Scientist Manulife Boston, Suffolk County At Manulife, data scientists collaborate with engineers, IT, and experts to build analytical mod... https://www.adzuna.com/land/ad/5513716776?se=mvDVO3HX8BGRNdJTGBOSjQ&utm_medium=api&utm_source=1f... NaN
4 1366 Data Scientist Generac Power Systems Pewaukee, Waukesha County We are Generac, a leading energy technology company committed to powering a smarter world. Over ... https://www.adzuna.com/land/ad/5528566412?se=mvDVO3HX8BGRNdJTGBOSjQ&utm_medium=api&utm_source=1f... NaN

Clean Text Fields and Create Helper ColumnsΒΆ

We standardize key string columns, drop duplicate source_urls, and prepare a normalized description column for lightweight NLP.

InΒ [5]:
ICON_PATTERN = r"[????????????????????????????????]"

# Basic QA + normalization to reduce noisy punctuation before NLP-heavy steps.
jobs_clean = (
    jobs_raw.copy()
    .dropna(subset=["title", "description"])
    .drop_duplicates(subset=["source_url"], keep="first")
    .assign(
        title=lambda df: df["title"].str.strip(),
        company=lambda df: df["company"].fillna("Unknown").str.strip(),
        location=lambda df: df["location"].fillna("Remote").str.strip(),
        description=lambda df: df["description"]
        .str.replace(
            ICON_PATTERN, " ", regex=True
        )  # strip arrow/bullet icons that survive scraping
        .str.replace(r"\s+", " ", regex=True)  # collapse multi-space/newline sequences
        .str.strip(),
    )
)

# Helper columns reused in later analysis/plots.
jobs_clean["description_lower"] = jobs_clean["description"].str.lower()
jobs_clean["job_label"] = jobs_clean.apply(
    lambda row: f"{row['title']} @ {row['company']}", axis=1
)

print(f"{len(jobs_clean):,} clean job postings remaining after QA.")
jobs_clean.head()
1,208 clean job postings remaining after QA.
Out[5]:
id title company location description source_url match_score description_lower job_label
0 1370 Data Scientist Insight Global South Camden, Camden County Job Description The Cooper Health System is seeking a Data Scientist to join its analytics team.... https://www.adzuna.com/land/ad/5513717171?se=mvDVO3HX8BGRNdJTGBOSjQ&utm_medium=api&utm_source=1f... NaN job description the cooper health system is seeking a data scientist to join its analytics team.... Data Scientist @ Insight Global
1 1369 Data Scientist Home Depot Atlanta, Fulton County Position Purpose: The Data Scientist is responsible for supporting data science initiatives that... https://www.adzuna.com/land/ad/5522795794?se=mvDVO3HX8BGRNdJTGBOSjQ&utm_medium=api&utm_source=1f... NaN position purpose: the data scientist is responsible for supporting data science initiatives that... Data Scientist @ Home Depot
2 1368 Data Scientist Stanford University Stanford, Santa Clara County Data Scientist Graduate School of Education, Stanford, California, United States Information Ana... https://www.adzuna.com/land/ad/5454326822?se=mvDVO3HX8BGRNdJTGBOSjQ&utm_medium=api&utm_source=1f... NaN data scientist graduate school of education, stanford, california, united states information ana... Data Scientist @ Stanford University
3 1367 Data Scientist Manulife Boston, Suffolk County At Manulife, data scientists collaborate with engineers, IT, and experts to build analytical mod... https://www.adzuna.com/land/ad/5513716776?se=mvDVO3HX8BGRNdJTGBOSjQ&utm_medium=api&utm_source=1f... NaN at manulife, data scientists collaborate with engineers, it, and experts to build analytical mod... Data Scientist @ Manulife
4 1366 Data Scientist Generac Power Systems Pewaukee, Waukesha County We are Generac, a leading energy technology company committed to powering a smarter world. Over ... https://www.adzuna.com/land/ad/5528566412?se=mvDVO3HX8BGRNdJTGBOSjQ&utm_medium=api&utm_source=1f... NaN we are generac, a leading energy technology company committed to powering a smarter world. over ... Data Scientist @ Generac Power Systems

** "From CSV Files" ** Start here to skip the database connection step. run the next cell to load the csv files directly.

InΒ [6]:
jobs_clean.to_csv("artifacts/jobs_clean.csv", index=False)
InΒ [7]:
summary = (
    jobs_clean.groupby("title")
    .agg(num_roles=("id", "count"), avg_match_score=("match_score", "mean"))
    .reset_index()
    .sort_values("num_roles", ascending=False)
)

summary.head(20)
Out[7]:
title num_roles avg_match_score
53 Data Engineer 254 0.431432
254 Sr. Azure Data Engineer 100 0.449779
265 Staff Data Engineer 65 0.414906
23 Business Analyst 47 NaN
37 Data Analyst 31 0.519519
184 Senior Data Engineer 25 0.412558
61 Data Engineer / Senior Data Engineer (GCP, BigQuery) 25 0.452121
60 Data Engineer / Senior Data Engineer (AI/ML) 25 0.432949
170 Principal Software Engineer 17 NaN
181 Senior Data Analyst 17 0.506788
34 Consultant, Data Engineering 16 0.465818
168 Principal Data Engineer 16 0.414667
206 Senior Software Engineer 16 NaN
264 Sr. Software Engineer 15 NaN
13 AWS Data Engineer 15 0.471633
57 Data Engineer - Manager 14 0.422792
157 Lead Software Engineer 14 NaN
102 GCP Data Engineer 14 0.394899
252 Sr Data Engineer 14 0.461888
124 IT Support Engineer 12 NaN

Dictionary-Based Skill Extraction (Baseline)ΒΆ

We first apply the curated keyword dictionary to tag skills, preserving the original deterministic baseline. These results are saved separately so we can compare coverage against the NLP-driven extraction that follows.

InΒ [8]:
import re
from typing import List, Tuple

# --- 1. CLEANED AND CORRECTED SKILL DICTIONARY ---
SKILLS_DICTIONARY = {
    "Programming_Tools": [
        "Python",
        "R",
        "SQL",
        "Scala",
        "Java",
        "Julia",
        "Bash",
        "Jupyter",
        "Git",
        "GitHub",
        "Pandas",
        "NumPy",
        "SciPy",
        "Matplotlib",
        "Seaborn",
        "Plotly",
    ],
   
    "MLOps_DevOps_Cloud": [
        "MLOps",
        "Docker",
        "Kubernetes",
        "Airflow",
        "MLflow",
        "Kubeflow",
        "DVC",
        "AWS",
        "SageMaker",
        "Azure ML",
        "GCP",
        "BigQuery",
        "S3",
        "Lambda",
        "Databricks",
        "Snowflake",
        "Spark",
        "PySpark",
        "Hadoop",
        "Kafka",
    ],
    "Machine_Learning_and_Artificial_Intelligence": [
        "Machine Learning",
        "Deep Learning",
        "Neural Network",
        "NLP",
        "Natural Language Processing",
        "Computer Vision",
        "CV",
        "Reinforcement Learning",
        "Supervised Learning",
        "Unsupervised Learning",
        "Semi-Supervised Learning",
        "Transfer Learning",
        "Time Series",
        "Anomaly Detection",
        "Clustering",
        "Classification",
        "Regression",
        "Cross Validation",
        "Dimensionality Reduction",
        "PCA",
        "t-SNE",
        "UMAP",
         "scikit-learn",
        "sklearn",
        "statsmodels",
        "PyTorch",
        "TensorFlow",
        "Keras",
        "XGBoost",
        "LightGBM",
        "CatBoost",
        "Prophet",
        "Sktime",
    ],
    "Statistics": [
        "Statistics",
        "Statistical Modeling",
        "Hypothesis Testing",
        "Bayesian Statistics",
        "Confidence Intervals",
        "P-Values",
        "ANOVA",
        "Regression Analysis",
        "Chisquared Test",
        "Distribution Fitting",
        "Data Exploration",
        "Summary Statistics",
        "Descriptive Statistics",
    ],
    "Finance_Fintech_Banking": [
        "Financial Markets",
        "Asset Management",
        "Quantitative Finance",
        "Quant",
        "Credit Risk",
        "Market Risk",
        "Operational Risk",
        "Underwriting",
        "Fraud Detection",
        "AML",
        "Anti-Money Laundering",
        "KYC",
        "Know Your Customer",
        "Basel III",
        "MiFID",
        "SEC",
        "Trading Algorithm",
        "Algorithmic Trading",
        "Fixed Income",
        "Derivatives",
        "Hedge Fund",
        "Valuation",
        "Credit Card",
    ],
    "Healthcare_Pharma_Biotech": [
        "Healthtech",
        "Biomedical",
        "Clinical Trial",
        "Drug Discovery",
        "EHR",
        "Electronic Health Record",
        "Epic",
        "Cerner",
        "FHIR",
        "HIPAA",
        "PHI",
        "Protected Health Information",
        "CMS",
        "Claims Data",
        "ICD-10",
        "CPT Codes",
        "Payer",
        "Provider",
        "HEDIS",
        "Population Health",
        "Biostatistics",
        "Genomics",
        "Radiology",
        "Medical Imaging",
    ],
    "Insurance_Risk_Underwriting": [
        "Underwriting",
        "Actuarial Science",
        "Loss Ratio",
        "Claims Analysis",
        "Cat Modeling",
        "Catastrophe Modeling",
        "P&C",
        "Property & Casualty",
        "Life Insurance",
        "Pricing Model",
        "Reserving",
        "Policy Administration",
        "Regulatory Compliance",
        "Solvency II",
        "Capital Modeling",
        "Telematics",
    ],
    "General_Business_Analytical": [
        "A/B Testing",
        "Experimentation",
        "ROI Analysis",
        "LTV",
        "Customer Lifetime Value",
        "Churn Prediction",
        "Demand Forecasting",
        "Supply Chain",
        "Logistics",
        "Pricing Optimization",
        "Causal Inference",
        "Econometrics",
        "KPIs",
    ],
    "Soft_Skills_Misc": [
        "Communication Skills",
        "Teamwork",
        "Problem Solving",
        "Critical Thinking",
        "Steakholder Management",
        "Steakholder Engagement",
        "Priotization",
        "Self Starter",
        "Detail Oriented",
        "Start up",
    ],
    "Software_Development_Engineering": [
        "API",
        "REST",
        "OOP",
        "Data Structures",
        "Algorithims",
        "Microservices",
        "CI/CD",
        "Agile",
        "Scrum",
        "TDD",
        "Unit Testing",
        "Integration Testing",
        "System Design",
        "Cloud Computing",
        "JavaScript",
        "TypeScript",
        "C++",
        "C#",
        "Ruby",
        "PHP",
        "Go",
        "Swift",
        "Objective-C",
        "HTML",
        "CSS",
        "React",
        "Angular",
        "Vue.js",
        "Node.js",
        "Django",
        "Flask",
    ],
    "Productivity_Tools": [
        "Excel",
        "Tableau",
        "Power BI",
        "Looker",
        "Google Data Studio",
        "PowerPoint",
        "Word",
        "Visio",
        "Notion",
        "Confluence",
        "Slack",
        "Trello",
        "Asana",
        "Jira","Google Sheets","Google Docs","Micro Soft Office"
    ],
}

SKILL_PATTERNS = {
    skill: {
        "category": category,
        "pattern": re.compile(rf"\b{re.escape(skill)}\b", re.IGNORECASE),
    }
    for category, skills in SKILLS_DICTIONARY.items()
    for skill in skills
}

def extract_skills_dictionary(text: str) -> List[Tuple[str, str]]:
    matches: List[Tuple[str, str]] = []
    for skill, meta in SKILL_PATTERNS.items():
        if meta["pattern"].search(text):
            matches.append((skill.title(), meta["category"]))
    return matches

jobs_clean["skills_dictionary"] = jobs_clean["description_lower"].apply(extract_skills_dictionary)

jobs_skill_map_dictionary = (
    jobs_clean[["id", "job_label", "company", "skills_dictionary"]]
    .explode("skills_dictionary")
    .dropna(subset=["skills_dictionary"])
)

jobs_skill_map_dictionary["skill_name"] = jobs_skill_map_dictionary["skills_dictionary"].apply(lambda item: item[0])
jobs_skill_map_dictionary["skill_category"] = jobs_skill_map_dictionary["skills_dictionary"].apply(lambda item: item[1])

print(
    f"Dictionary extractor tagged {jobs_skill_map_dictionary['skill_name'].nunique():,} unique skills across {jobs_skill_map_dictionary['id'].nunique():,} jobs."
)


jobs_clean["skills"] = jobs_clean["skills_dictionary"].apply(
    lambda pairs: [skill for skill, _ in pairs] if isinstance(pairs, list) else []
)

jobs_skill_map = jobs_skill_map_dictionary.copy()
Dictionary extractor tagged 128 unique skills across 490 jobs.
InΒ [9]:
# Build the bipartite job-skill graph (jobs on one partition, skills on the other).
B = nx.Graph()

# Add job nodes so each posting is uniquely identified in the graph.
for _, row in jobs_clean.iterrows():
    job_node = f"job_{row['id']}"
    B.add_node(
        job_node,
        bipartite="job",
        label=row["job_label"],
        company=row["company"],
        location=row["location"],
    )

# Add skill nodes and connect them to the jobs that mention them.
for _, row in jobs_skill_map.iterrows():
    skill_node = f"skill_{row['skill_name'].lower().replace(' ', '_')}"
    B.add_node(
        skill_node,
        bipartite="skill",
        label=row["skill_name"],
        category=row["skill_category"],
    )
    job_node = f"job_{row['id']}"
    B.add_edge(job_node, skill_node)

print(f"Graph has {B.number_of_nodes():,} nodes and {B.number_of_edges():,} edges.")
Graph has 1,336 nodes and 1,973 edges.
InΒ [10]:
MAX_JOBS_FOR_BIPARTITE_PLOT = jobs_clean.shape[0]  # Adjust this number to limit the jobs shown in the plot.
sample_job_nodes = [
    f"job_{job_id}" for job_id in jobs_clean.head(MAX_JOBS_FOR_BIPARTITE_PLOT)["id"]
]
connected_skill_nodes = set()
for job in sample_job_nodes:
    connected_skill_nodes.update(B.neighbors(job))

subgraph_nodes = sample_job_nodes + list(connected_skill_nodes)
H_bipartite = B.subgraph(subgraph_nodes).copy()

if not H_bipartite:
    raise ValueError(
        "Bipartite subgraph is empty. Ensure jobs_skill_map is populated before plotting."
    )

job_nodes_sub = [
    n for n, d in H_bipartite.nodes(data=True) if d.get("bipartite") == "job"
]
skill_nodes_sub = [n for n in H_bipartite if n not in job_nodes_sub]

pos = {}
# Align jobs on the left (x=0) and skills on the right (x=1) for clarity.
pos.update((node, (0, idx)) for idx, node in enumerate(job_nodes_sub))
pos.update((node, (1, idx)) for idx, node in enumerate(skill_nodes_sub))

plt.figure(figsize=(12, 8))
nx.draw_networkx_nodes(
    H_bipartite,
    pos,
    nodelist=job_nodes_sub,
    node_color="#1f3ab4f0",
    node_size=200,
    label="Jobs",
)
nx.draw_networkx_nodes(
    H_bipartite,
    pos,
    nodelist=skill_nodes_sub,
    node_color="#ff520e",
    node_size=200,
    label="Skills",
)
nx.draw_networkx_edges(H_bipartite, pos, alpha=0.3)


plt.legend()
plt.title("Bipartite Job >> Skill Graph")
plt.axis("off")
plt.tight_layout()
plt.show()
No description has been provided for this image

Lets look at the most popular job titles in the dataset.

InΒ [11]:
title_counts = jobs_clean["title"].value_counts()

title_counts.head(20)
px.bar(
    title_counts.head(20),
    title="Top 10 titles by Number of Job Postings",
    labels={"index": "Job Title", "value": "Number of Job Postings"},
)
No description has been provided for this image

Now that we see the type of job titles in the dataset, we can move forward with building the job-skill bipartite graph. lets look at top skills pulled from the job descriptions next.

InΒ [12]:
skill_counts = (
    jobs_skill_map.groupby(["skill_name"])
    .agg(num_jobs=("id", "count"))
    .reset_index()
    .sort_values("num_jobs", ascending=False)
)

px.bar(
    skill_counts.head(30),
    x="skill_name",
    y="num_jobs",
    #color="skill_category",
    title="Top Skills Across Alfred Jobs",
)
No description has been provided for this image
InΒ [13]:
from networkx.algorithms import bipartite

# Separate the partitions so we can project skills onto themselves.
skill_nodes = {n for n, d in B.nodes(data=True) if d.get("bipartite") == "skill"}
job_nodes = set(B) - skill_nodes

# Skills are connected when they co-occur in at least one job posting (weighted by frequency).
skill_projection = bipartite.weighted_projected_graph(B, skill_nodes)
print(
    f"Skill projection has {skill_projection.number_of_nodes()} nodes / {skill_projection.number_of_edges()} edges"
)

# Degree centrality surfaces the skills that connect to the most neighbors in this co-occurrence graph.
centrality = nx.degree_centrality(skill_projection)
centrality_df = pd.DataFrame(
    {
        "skill_node": list(skill_projection.nodes()),
        "skill_name": [
            skill_projection.nodes[n]["label"] for n in skill_projection.nodes
        ],
        "category": [
            skill_projection.nodes[n]["category"] for n in skill_projection.nodes
        ],
        "centrality": [centrality[n] for n in skill_projection.nodes],
    }
).sort_values("centrality", ascending=False)

centrality_df.head(40)
Skill projection has 128 nodes / 1765 edges
Out[13]:
skill_node skill_name category centrality
111 skill_python Python Programming_Tools 0.811024
1 skill_sql Sql Programming_Tools 0.803150
109 skill_communication_skills Communication Skills Soft_Skills_Misc 0.661417
116 skill_aws Aws MLOps_DevOps_Cloud 0.598425
35 skill_machine_learning Machine Learning Machine_Learning_and_Artificial_Intelligence 0.590551
... ... ... ... ...
94 skill_powerpoint Powerpoint Productivity_Tools 0.299213
115 skill_hipaa Hipaa Healthcare_Pharma_Biotech 0.283465
95 skill_scikit-learn Scikit-Learn Machine_Learning_and_Artificial_Intelligence 0.267717
48 skill_kpis Kpis General_Business_Analytical 0.267717
86 skill_angular Angular Software_Development_Engineering 0.267717

40 rows ÃҀ” 4 columns

InΒ [14]:
centrality_df.head(10)
Out[14]:
skill_node skill_name category centrality
111 skill_python Python Programming_Tools 0.811024
1 skill_sql Sql Programming_Tools 0.803150
109 skill_communication_skills Communication Skills Soft_Skills_Misc 0.661417
116 skill_aws Aws MLOps_DevOps_Cloud 0.598425
35 skill_machine_learning Machine Learning Machine_Learning_and_Artificial_Intelligence 0.590551
66 skill_java Java Programming_Tools 0.543307
100 skill_agile Agile Software_Development_Engineering 0.527559
127 skill_go Go Software_Development_Engineering 0.511811
61 skill_tableau Tableau Productivity_Tools 0.503937
57 skill_life_insurance Life Insurance Insurance_Risk_Underwriting 0.503937

Lets see if we can visualize the top 40 nodes by their degree centrality.

InΒ [15]:
# Focus on the most central skills so the visualization stays interpretable.
top_nodes = centrality_df.head(50)["skill_node"]
H = skill_projection.subgraph(top_nodes).copy()
pos = nx.spring_layout(H, seed=42)  # deterministic layout for reproducibility
nodes = list(H.nodes())
x = [pos[n][0] for n in nodes]
y = [pos[n][1] for n in nodes]
size = [10 + 80 * centrality[n] for n in nodes]
categories = pd.Categorical([H.nodes[n]["category"] for n in nodes])

fig = go.Figure()

# edges as line segments
for u, v, data in H.edges(data=True):
    fig.add_trace(
        go.Scatter(
            x=[pos[u][0], pos[v][0]],
            y=[pos[u][1], pos[v][1]],
            mode="lines",
            line=dict(width=max(1, data.get("weight", 1) * 0.2), color="#cccccc"),
            hoverinfo="skip",
            showlegend=False,
        )
    )

# nodes
fig.add_trace(
    go.Scatter(
        x=x,
        y=y,
        mode="markers+text",
        text=[H.nodes[n]["label"] for n in nodes],
        textposition="bottom center",
        marker=dict(
            size=size,
            color=categories.codes,
            colorscale="Turbo",
            line=dict(width=1, color="#333"),
            
        ),
        hovertext=[
            f"{H.nodes[n]['label']}<br>Category: {H.nodes[n]['category']}<br>Centrality: {centrality[n]:.3f}"
            for n in nodes
        ],
        hoverinfo="text",
    )
)

fig.update_layout(
    title="Top Skill Co-occurrence Network",
    xaxis=dict(visible=False),
    yaxis=dict(visible=False),
    showlegend=False,
    margin=dict(l=20, r=20, t=60, b=20),
)
fig.show()
No description has been provided for this image

Persist Intermediate Tables (Optional)ΒΆ

  • You don't need to run this cell if you are starting from the CSV files above.

Export clean job and job/skill mapping tables so they can be fed into Gephi or downstream dashboards without re-querying the live database.

InΒ [16]:
# Persist tidy tables so collaborators can use them without re-running the database queries.
OUTPUT_DIR = PROJECT_ROOT / "artifacts"
OUTPUT_DIR.mkdir(exist_ok=True)

# Parquet for efficient reloads.
jobs_clean.to_parquet(OUTPUT_DIR / "jobs_clean.parquet", index=False)
jobs_skill_map.to_parquet(OUTPUT_DIR / "jobs_skill_map.parquet", index=False)

# CSV exports for easy sharing/export to other tools.
jobs_clean.to_csv(OUTPUT_DIR / "jobs_clean.csv", index=False)
jobs_skill_map.to_csv(OUTPUT_DIR / "jobs_skill_map.csv", index=False)

print(f"Saved artifacts to {OUTPUT_DIR}")
Saved artifacts to d:\Data620\FinalProject\artifacts

Now we can see if we detect any communities among the top skills using the Louvain method. The Louvain method is a popular algorithm for community detection in large networks. It optimizes modularity to find clusters of nodes that are more densely connected internally than with the rest of the network. It does this through a two-phase process: first, it assigns each node to its own community and iteratively merges communities to maximize modularity; second, it builds a new network where nodes represent the detected communities and repeats the process until no further modularity improvement is possible.

Skill "Island" CommunitiesΒΆ

To surface densely connected pockets of skills, we threshold the skill-projection edges by weight and run a modularity-based community detector. Tweak the WEIGHT_THRESHOLD or drop the filter entirely if you want to study the full graph.

InΒ [17]:
weights = [data["weight"] for _, _, data in skill_projection.edges(data=True)]
min_weight = min(weights)
max_weight = max(weights)
print(min_weight, max_weight)


WEIGHT_THRESHOLD = 3  # minimum shared-job weight required to keep an edge

filtered_edges = [
    (u, v)
    for u, v, data in skill_projection.edges(data=True)
    if data.get("weight", 0) >= WEIGHT_THRESHOLD
]

if not filtered_edges:
    raise ValueError(
        "No edges survived the weight threshold. Adjust WEIGHT_THRESHOLD or refine the NLP skill extraction heuristics."
    )

# edge_subgraph preserves node attributes (label/category) from the original skill projection
island_graph = skill_projection.edge_subgraph(filtered_edges).copy()

communities = list(nx_comm.greedy_modularity_communities(island_graph, weight="weight"))
community_map = {
    node: idx for idx, nodes in enumerate(communities, start=1) for node in nodes
}

community_summary = pd.DataFrame(
    [
        {
            "community_id": idx,
            "size": len(nodes),
            "sample_skills": ", ".join(
                sorted(island_graph.nodes[n]["label"] for n in list(nodes)[:5])
            ),
        }
        for idx, nodes in enumerate(communities, start=1)
    ]
).sort_values("size", ascending=False)

print(f"Detected {len(communities)} dense skill communities after thresholding.")
community_summary.head(20)
1 90
Detected 5 dense skill communities after thresholding.
Out[17]:
community_id size sample_skills
0 1 26 Communication Skills, Flask, Pandas, Sql, Teamwork
1 2 21 Classification, Deep Learning, Fraud Detection, Javascript, React
2 3 18 Go, Jira, Life Insurance, Microservices, Typescript
3 4 17 Airflow, Github, Hadoop, Kubernetes, Scala
4 5 2 Credit Card, Statistical Modeling
InΒ [18]:
# Visualize islands with Plotly for interactive inspection
H = island_graph  # already contains filtered nodes with attributes
pos = nx.spring_layout(H, seed=42, k=0.8)
nodes = list(H.nodes())
node_x = [pos[n][0] for n in nodes]
node_y = [pos[n][1] for n in nodes]
node_sizes = [12 + 30 * centrality.get(n, 0) for n in nodes]
communities_cat = pd.Categorical([community_map.get(n, 0) for n in nodes])

fig = go.Figure()
for u, v, data in H.edges(data=True):
    fig.add_trace(
        go.Scatter(
            x=[pos[u][0], pos[v][0]],
            y=[pos[u][1], pos[v][1]],
            mode="lines",
            line=dict(width=max(1, data.get("weight", 1) * 0.15), color="#bbbbbb"),
            hoverinfo="skip",
            showlegend=False,
        )
    )

fig.add_trace(
    go.Scatter(
        x=node_x,
        y=node_y,
        mode="markers+text",
        text=[H.nodes[n]["label"] for n in nodes],
        textposition="bottom center",
        marker=dict(
            size=node_sizes,
            color=communities_cat.codes,
            colorscale="Plasma",
            line=dict(width=1, color="#333"),
        ),
        hovertext=[
            f"{H.nodes[n]['label']}<br>Community: {community_map.get(n)}<br>Degree: {H.degree(n)}"
            for n in nodes
        ],
        hoverinfo="text",
        showlegend=False,
    )
)

fig.update_layout(
    title="Skill Islands (communities on weighted projection)",
    xaxis=dict(visible=False),
    yaxis=dict(visible=False),
    margin=dict(l=20, r=20, t=60, b=20),
)
fig.write_html("skill_islands_network.html", auto_open=True)
fig.write_json("skill_islands_visualization.json")

Dyanmic -Analysis NLP Skill ExtractionΒΆ

After completing the dictionary-driven network analysis above, we can optionally generate a dynamic skill inventory with spaCy to compare coverage. This cell works on a copy of jobs_clean so the baseline results stay intact.

We use spaCy's PhraseMatcher to identify skill phrases in job descriptions. We then rebuild the job-skill bipartite graph using the NLP-extracted skills and repeat the centrality and community detection analyses to compare against the baseline. We add some stopword skills to filter out common but uninformative terms.

InΒ [35]:
import itertools
from collections import Counter

import spacy
from spacy.cli import download as spacy_download
from spacy.lang.en.stop_words import STOP_WORDS
from sklearn.cluster import MiniBatchKMeans
from sklearn.feature_extraction.text import TfidfVectorizer
# make a copy of jobs_clean to avoid modifying the baseline skill extraction results
jobs_clean_nlp = jobs_clean.copy().reset_index(drop=True)

try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    spacy_download("en_core_web_sm")
    nlp = spacy.load("en_core_web_sm")
# Adjust spaCy's max_length if any job description exceeds the default limit.
max_desc_length = int(jobs_clean_nlp["description"].fillna("").str.len().max())
if max_desc_length:
    nlp.max_length = max(nlp.max_length, int(max_desc_length * 1.2))
# Define custom stopwords and filtering heuristics.
GENERIC_TERMS = {
    "ability",
    "abilities",
    "team",
    "teams",
    "company",
    "companies",
    "business",
    "clients",
    "client",
    "stakeholders",
    "stakeholder",
    "role",
    "roles",
    "position",
    "positions",
    "responsibilities",
    "responsibility",
    "experience",
    "experiences",
    "people",
    "solutions",
    "services",
    "product",
    "products",
    "organization",
    "organizations",
    "environment",
    "environments",
    "technology",
    "technologies",
    "year",
    "years",
    "degree",
    "degrees",
    "applicant",
    "applicants",
    "job",
    "jobs",
    "candidate",
    "candidates",
    "orientation","Data Engineer","Data Scientist"
}
# Words that are unlikely to be part of a skill phrase when they appear at the end.
BAD_ENDINGS = {"team", "teams", "company", "business", "clients", "stakeholders", "organization", "environment", "technology", "year", "job"}

# some custom stopwords relevant to job descriptions
CUSTOM_STOPWORDS = {
    "technology",
    "technologies",
    "year",
    "years",
    "degree",
    "degrees",
    "applicant",
    "applicants",
    "job",
    "jobs",
    "candidate",
    "candidates",
    "orientation",
    "team",
    "teams",
    "company",
    "business",
    "people","Title"
}
STOPWORD_SET = {word.lower() for word in STOP_WORDS}.union(CUSTOM_STOPWORDS)

DICTIONARY_SINGLE_SKILLS = {
    skill.lower()
    for skills in SKILLS_DICTIONARY.values()
    for skill in skills
}
SINGLE_SKILL_ALLOWLIST = DICTIONARY_SINGLE_SKILLS.union(
    {
        "data",
        "analytics",
        "analysis",
        "integration",
        "engineering",
        "modeling",
        "machine",
        "learning",
        "ai",
        "ml",
        "python",
        "sql",
        "spark",
        "aws",
        "azure",
        "gcp",
    }
)
ANCHOR_TERMS = {
    "data",
    "analytics",
    "analysis",
    "integration",
    "science",
    "engineering",
    "machine",
    "learning",
    "intelligence",
    "modeling",
    "ai",
    "ml",
    "cloud",
    "platform",
    "product",
    "pipeline",
    "governance",
    "visualization",
    "automation",
    "insight",
    "insights",
}
# Normalization function to clean and standardize phrases.
def normalize_phrase(text: str) -> str:
    text = text.lower()
    text = re.sub(r"[^a-z0-9+#/&\- ]+", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text
# Token filtering functions.
def token_is_candidate(token) -> bool:
    if token.is_stop or token.is_punct:
        return False
    lemma = token.lemma_.lower()
    if lemma in STOPWORD_SET:
        return False
    return token.pos_ in {"PROPN", "NOUN", "ADJ"}

# Heuristic to allow certain single-token skills.
def allow_single_token(token, norm: str) -> bool:
    lemma = token.lemma_.lower()
    text = token.text
    if lemma in SINGLE_SKILL_ALLOWLIST:
        return True
    if text.isupper() and len(text) <= 4:
        return True
    if any(char.isdigit() for char in text):
        return True
    return False

# Heuristic to filter out non-skill-like phrases.
def is_skill_like(text: str) -> bool:
    if not text or len(text) < 3:
        return False
    words = text.split()
    if len(words) > 4:
        return False
    if all(word in GENERIC_TERMS for word in words):
        return False
    if words[-1] in BAD_ENDINGS:
        return False
    if text in STOPWORD_SET:
        return False
    return True

# Main extraction function for a spaCy doc.
def extract_skill_candidates(doc):
    phrases = set()
    candidate_tokens = [token for token in doc if token_is_candidate(token)]

    for chunk in doc.noun_chunks:
        norm = normalize_phrase(chunk.text)
        if not is_skill_like(norm):
            continue
        words = norm.split()
        if len(words) == 1:
            token = chunk.root
            if allow_single_token(token, norm):
                phrases.add(norm)
        else:
            if any(word in ANCHOR_TERMS for word in words):
                phrases.add(norm)
# Single-token candidates
    for token in candidate_tokens:
        norm = normalize_phrase(token.text)
        if is_skill_like(norm) and allow_single_token(token, norm):
            phrases.add(norm)
# Multi-token candidates (2-3 grams)
    max_window = 3
    for window in range(2, max_window + 1):
        for i in range(len(candidate_tokens) - window + 1):
            phrase_tokens = candidate_tokens[i : i + window]
            lemma_window = [token.lemma_.lower() for token in phrase_tokens]
            if not any(lemma in ANCHOR_TERMS for lemma in lemma_window):
                continue
            phrase = " ".join(token.text for token in phrase_tokens)
            norm = normalize_phrase(phrase)
            if is_skill_like(norm):
                phrases.add(norm)

    return sorted(phrases)

# Extract skill candidates from each job description.
descriptions = jobs_clean_nlp["description"].fillna("")
skill_candidates = []
for doc in tqdm(
    nlp.pipe(descriptions.tolist(), batch_size=32),
    total=len(descriptions),
    desc="Extracting skill candidates",
):
    skill_candidates.append(extract_skill_candidates(doc))

jobs_clean_nlp["skill_candidates_raw"] = skill_candidates

candidate_counts = Counter(itertools.chain.from_iterable(skill_candidates))
if not candidate_counts:
    raise ValueError("spaCy-based skill extraction did not produce any candidates.")
# Apply global frequency filtering to build the skill vocabulary.
MIN_GLOBAL_FREQ = 5
MAX_GLOBAL_SHARE = 0.65
max_allowed = max(1, int(MAX_GLOBAL_SHARE * len(jobs_clean_nlp)))
skill_vocabulary = {
    phrase
    for phrase, count in candidate_counts.items()
    if count >= MIN_GLOBAL_FREQ and count <= max_allowed
    and phrase not in STOPWORD_SET
}

# Finalize skill list for each job by filtering to the vocabulary and deduplicating.
def finalize_skills(candidate_list):
    filtered = [phrase for phrase in candidate_list if phrase in skill_vocabulary]
    return sorted(set(filtered))

# Apply finalization to each job's candidates.
jobs_clean_nlp["skills_nlp"] = [
    [phrase.title() for phrase in finalize_skills(candidates)]
    for candidates in skill_candidates
]
# Build the job-skill mapping DataFrame from NLP-extracted skills.
jobs_skill_map_dynamic = (
    jobs_clean_nlp[["id", "job_label", "company", "skills_nlp"]]
    .explode("skills_nlp")
    .dropna(subset=["skills_nlp"])
    .rename(columns={"skills_nlp": "skill_name"})
)
# Ensure we have some skills after filtering.
if jobs_skill_map_dynamic.empty:
    raise ValueError("No NLP-derived skills survived the filtering heuristics.")
# Prepare for clustering by normalizing skill names.
jobs_skill_map_dynamic["skill_name"] = jobs_skill_map_dynamic["skill_name"].str.strip()
 
# Cluster similar skills using TF-IDF + MiniBatchKMeans.
unique_skills = jobs_skill_map_dynamic["skill_name"].str.lower().unique().tolist()
skill_category_lookup = {}
if unique_skills:
    vectorizer = TfidfVectorizer(ngram_range=(1, 2), stop_words="english")
    skill_matrix = vectorizer.fit_transform(unique_skills)
    n_clusters = min(8, len(unique_skills))
    if n_clusters == 1:
        cluster_labels = [0]
    else:
        clusterer = MiniBatchKMeans(
            n_clusters=n_clusters,
            random_state=42,
            batch_size=min(256, len(unique_skills)),
            max_iter=200,
        )
        cluster_labels = clusterer.fit_predict(skill_matrix)
    skill_category_lookup = {
        skill: f"NLP Cluster {label + 1}"
        for skill, label in zip(unique_skills, cluster_labels)
    }
# Map clustered categories back to the job-skill mapping.
jobs_skill_map_dynamic["skill_category"] = jobs_skill_map_dynamic["skill_name"].str.lower().map(
    skill_category_lookup
)
jobs_skill_map_dynamic["skill_category"] = jobs_skill_map_dynamic["skill_category"].fillna("NLP Cluster 1")
jobs_skill_map_dynamic["skill_strength"] = jobs_skill_map_dynamic["skill_name"].str.lower().map(candidate_counts)
 
# Summary of NLP-extracted skills.
print(
    f"NLP extractor tagged {jobs_skill_map_dynamic['skill_name'].nunique():,} unique skills across {jobs_skill_map_dynamic['id'].nunique():,} jobs."
)
Extracting skill candidates:   0%|          | 0/1208 [00:00<?, ?it/s]
NLP extractor tagged 965 unique skills across 1,026 jobs.
InΒ [36]:
skill_extraction_summary = pd.DataFrame(
    [
        {
            "method": "Dictionary",
            "unique_skills": jobs_skill_map_dictionary["skill_name"].nunique(),
            "jobs_with_skills": jobs_skill_map_dictionary["id"].nunique(),
        },
        {
            "method": "NLP",
            "unique_skills": jobs_skill_map_dynamic["skill_name"].nunique(),
            "jobs_with_skills": jobs_skill_map_dynamic["id"].nunique(),
        },
    ]
)
skill_extraction_summary
Out[36]:
method unique_skills jobs_with_skills
0 Dictionary 128 490
1 NLP 965 1026
InΒ [37]:
skill_frequency = (
    jobs_skill_map_dynamic["skill_name"].value_counts().rename_axis("skill_name").reset_index(name="mention_count")
)
skill_frequency
Out[37]:
skill_name mention_count
0 Data 627
1 Engineering 389
2 Data Engineer 359
3 Analysis 259
4 Data Engineering 202
... ... ...
960 Ai Machine 5
961 Emr 5
962 Confluence 5
963 System Integration 5
964 Product Backlog 5

965 rows Γƒβ€” 2 columns

InΒ [38]:
from spacy.lang.en.stop_words import STOP_WORDS


custom_stopwords = {
    "technology",
    "technologies",
    "year",
    "years",
    "degree",
    "degrees",
    "applicant",
    "applicants",
    "job",
    "jobs",
    "candidate",
    "candidates",
    "orientation",
    "team",
    "teams",
    "company",
    "business",
    "people",
    "Excellent",
    "Which",
    "Must",
    "Will",
    "Work",
    "Using",
    "knowledge",
    "skills",
    "high",
    "engineer",
    "Science",
    "an employee",
    "aware",
    "Service",
    "Services",
    "Based",
    "Level",
    "Levels",
    "Various",
    "including",
    "related",
    "field",
    "fields",
    "areas",
    "area",
    "Entreprise",
    "enterprises",
    "The role",
    "ability",
    "abilities",
    "stakeholder",
    "stakeholders",
    "data",
    "organization",
    "organizations",
    "environment",
    "environments",
    "range",
    "the range",
    "life",
    "lives",
    "the life",
    "datum",
    "platform",
    "engineering",
    "new",
    "design",
    "analysts",
    "analyst",
    "functional",
    "industry",
    "problem",
    "tool",
    "decision",
    "application",
    "dtrong",
    "day",
    "cross",
    "news",
    "development",
    "this role",
    "solution",
    "time",
    "world",
    "system",
    "systems",
    "opportunity",
    "opportunities",
    "technical",
    "skill",
    "software",
    "quality",
    "customers",
    "customer",
    "location",
    "locations",
    "process",
    "impact",
    "impacts",
    "member",
    "description",
    "project",
    "strong",
    "benefit",
    "analysis",
    "support",
    "enterprise",
    "enterprises",
    "management",
    "employee","information","informations","end","lead","employment",
    "work","works","initiative","requirement","qualification","qualifications",
    "complex","complexity","collaborate","collaborates","collaboration", "disability","diversity", "innovation","innovative","infrastructure","expertise",
    "performance","performances",
    "methodology","methodologies","method","methods","status","euality","equal","employer","employers","global","analytics","analytic",
    "individual","Office","401 K","U S","USD"
    
}
custom_stopwords = {word.lower() for word in custom_stopwords}
stopword_flags = skill_frequency.assign(
    skill_lower=lambda df: df["skill_name"].str.lower(),
    spacy_stop=lambda df: df["skill_lower"].isin(STOP_WORDS),
    custom_stop=lambda df: df["skill_lower"].isin(custom_stopwords),
)
stopword_flags[["skill_name", "mention_count", "spacy_stop", "custom_stop"]]
stopword_flags_filtered = stopword_flags[
    ~(stopword_flags["spacy_stop"] | stopword_flags["custom_stop"])
].reset_index(drop=True)

stopword_flags_filtered.head(20)
Out[38]:
skill_name mention_count skill_lower spacy_stop custom_stop
0 Data Engineer 359 data engineer False False
1 Data Engineering 202 data engineering False False
2 Azure 159 azure False False
3 Integration 148 integration False False
4 Python 135 python False False
5 Sql 135 sql False False
6 Computer Science 123 computer science False False
7 Azure Data 115 azure data False False
8 Data Pipelines 115 data pipelines False False
9 Data Engineer Impact 114 data engineer impact False False
10 Data Platforms 114 data platforms False False
11 Azure Data Engineer 112 azure data engineer False False
12 Data Integration 111 data integration False False
13 Product Owners 107 product owners False False
14 Engineering Analytics 105 engineering analytics False False
15 Data Warehousing 104 data warehousing False False
16 Data Engineering Analytics 103 data engineering analytics False False
17 Member Data 102 member data False False
18 Analysis Development 101 analysis development False False
19 Data Warehousing Cloud 101 data warehousing cloud False False
InΒ [39]:
# Filter NLP-derived skills to drop stop words before ranking
stopword_set = {word.lower() for word in STOP_WORDS}.union(custom_stopwords)
jobs_skill_map_dynamic = jobs_skill_map_dynamic[
    ~jobs_skill_map_dynamic["skill_name"].str.lower().isin(stopword_set)
].copy()
skill_frequency = (
    jobs_skill_map_dynamic["skill_name"].value_counts()
    .rename_axis("skill_name")
    .reset_index(name="mention_count")
)
skill_frequency.head(50)
Out[39]:
skill_name mention_count
0 Data Engineer 359
1 Data Engineering 202
2 Azure 159
3 Integration 148
4 Sql 135
... ... ...
45 Learning 70
46 Machine 70
47 Artificial Intelligence 65
48 Etl 58
49 Java 58

50 rows Γƒβ€” 2 columns

InΒ [40]:
skill_freq_ranked = (
    skill_frequency.sort_values("mention_count", ascending=False)
    .reset_index(drop=True)
    .reset_index()
    .rename(columns={"index": "skill_rank"})
)
fig = px.bar(
    skill_freq_ranked,
    x="skill_rank",
    y="mention_count",
    title="Skill Mention Frequency (NLP Extraction)",
    labels={"skill_rank": "Skill Rank", "mention_count": "Mentions"},
)

fig.update_traces(
    hovertemplate="Rank %{x}<br>Skill: %{customdata[0]}<br>Mentions: %{y}",
    customdata=skill_freq_ranked[["skill_name"]].values,
)
fig.show()
No description has been provided for this image

NLP Skill GraphsΒΆ

Reconstruct the job-skill network using the NLP-derived skills and compare it against the dictionary-based graph.

InΒ [Β ]:
# Build the NLP-based bipartite graph and projected skill network
try:
    jobs_skill_map_dynamic
    jobs_clean_nlp
except NameError as exc:  # pragma: no cover
    raise RuntimeError("Run the NLP extraction cells before this section.") from exc
# Build the bipartite job-skill graph (jobs on one partition, skills on the other).
B_nlp = nx.Graph()
for _, row in jobs_clean_nlp.iterrows():
    job_node = f"job_{row['id']}"
    B_nlp.add_node(
        job_node,
        bipartite="job",
        label=row.get("job_label", row.get("title", "Job")),
        company=row.get("company", "Unknown"),
        location=row.get("location", "Unknown"),
    )
# Add skill nodes and connect them to the jobs that mention them.
for _, row in jobs_skill_map_dynamic.iterrows():
    job_node = f"job_{row['id']}"
    skill_node = f"skill_{row['skill_name'].lower().replace(' ', '_')}"
    B_nlp.add_node(
        skill_node,
        bipartite="skill",
        label=row["skill_name"],
        category=row.get("skill_category", "NLP"),
    )
    if B_nlp.has_node(job_node):
        B_nlp.add_edge(job_node, skill_node)
# Ensure the graph is not empty.
if B_nlp.number_of_nodes() == 0:
    raise ValueError("NLP bipartite graph is empty. Ensure jobs_skill_map_dynamic is populated.")
# Summarize the NLP bipartite graph.
skill_nodes_nlp = {n for n, d in B_nlp.nodes(data=True) if d.get("bipartite") == "skill"}
job_nodes_nlp = set(B_nlp) - skill_nodes_nlp
print(f"B_nlp contains {len(job_nodes_nlp):,} job nodes, {len(skill_nodes_nlp):,} skill nodes, and {B_nlp.number_of_edges():,} edges.")
# Project skills onto themselves based on co-occurrence in job postings.
skill_projection_nlp = bipartite.weighted_projected_graph(B_nlp, skill_nodes_nlp)
print(
    f"NLP skill projection has {skill_projection_nlp.number_of_nodes():,} nodes / {skill_projection_nlp.number_of_edges():,} edges"
)
# Compute degree centrality on the NLP skill projection.
centrality_nlp = nx.degree_centrality(skill_projection_nlp)
centrality_df_nlp = pd.DataFrame(
    {
        "skill_node": list(skill_projection_nlp.nodes()),
        "skill_name": [skill_projection_nlp.nodes[n]["label"] for n in skill_projection_nlp.nodes],
        "category": [skill_projection_nlp.nodes[n].get("category", "NLP") for n in skill_projection_nlp.nodes],
        "centrality": [centrality_nlp[n] for n in skill_projection_nlp.nodes],
    }
).sort_values("centrality", ascending=False)

centrality_df_nlp.head(20)
B_nlp contains 1,208 job nodes, 958 skill nodes, and 17,652 edges.
NLP skill projection has 958 nodes / 47,328 edges
Out[Β ]:
skill_node skill_name category centrality
460 skill_python Python NLP Cluster 4 0.613375
735 skill_sql Sql NLP Cluster 4 0.589342
429 skill_data_engineer Data Engineer NLP Cluster 4 0.584117
457 skill_computer_science Computer Science NLP Cluster 4 0.569488
647 skill_learning Learning NLP Cluster 4 0.538140
472 skill_aws Aws NLP Cluster 4 0.518286
58 skill_data_pipelines Data Pipelines NLP Cluster 8 0.504702
449 skill_data_engineering Data Engineering NLP Cluster 4 0.495298
60 skill_azure Azure NLP Cluster 4 0.462905
244 skill_etl Etl NLP Cluster 4 0.455590
951 skill_modeling Modeling NLP Cluster 4 0.432602
764 skill_integration Integration NLP Cluster 4 0.425287
777 skill_machine Machine NLP Cluster 4 0.422153
601 skill_bachelor_computer_science Bachelor Computer Science NLP Cluster 4 0.409613
849 skill_experience_data Experience Data NLP Cluster 3 0.385580
272 skill_engineering_related Engineering Related NLP Cluster 4 0.371996
827 skill_java Java NLP Cluster 4 0.366771
295 skill_spark Spark NLP Cluster 4 0.365726
311 skill_machine_learning Machine Learning NLP Cluster 4 0.364681
668 skill_data_analysis Data Analysis NLP Cluster 4 0.364681
InΒ [42]:
def summarize_projection(graph, skill_nodes_count, job_nodes_count, projection):
    degree_values = [deg for _, deg in projection.degree()]
    avg_degree = (sum(degree_values) / len(degree_values)) if degree_values else 0
    return {
        "jobs": job_nodes_count,
        "skills": skill_nodes_count,
        "job_skill_edges": graph.number_of_edges(),
        "projection_nodes": projection.number_of_nodes(),
        "projection_edges": projection.number_of_edges(),
        "projection_avg_degree": avg_degree,
    }

summary_df = pd.DataFrame(
    [
        {"pipeline": "Dictionary", **summarize_projection(B, len(skill_nodes), len(job_nodes), skill_projection)},
        {"pipeline": "NLP", **summarize_projection(B_nlp, len(skill_nodes_nlp), len(job_nodes_nlp), skill_projection_nlp)},
    ]
)
summary_df
Out[42]:
pipeline jobs skills job_skill_edges projection_nodes projection_edges projection_avg_degree
0 Dictionary 1208 128 1973 128 1765 27.578125
1 NLP 1208 958 17652 958 47328 98.805846
InΒ [43]:
top_k = 20
top_centrality = pd.concat(
    [
        centrality_df.head(top_k).assign(pipeline="Dictionary"),
        centrality_df_nlp.head(top_k).assign(pipeline="NLP"),
    ],
    ignore_index=True,
)
top_centrality
Out[43]:
skill_node skill_name category centrality pipeline
0 skill_python Python Programming_Tools 0.811024 Dictionary
1 skill_sql Sql Programming_Tools 0.803150 Dictionary
2 skill_communication_skills Communication Skills Soft_Skills_Misc 0.661417 Dictionary
3 skill_aws Aws MLOps_DevOps_Cloud 0.598425 Dictionary
4 skill_machine_learning Machine Learning Machine_Learning_and_Artificial_Intelligence 0.590551 Dictionary
... ... ... ... ... ...
35 skill_engineering_related Engineering Related NLP Cluster 4 0.371996 NLP
36 skill_java Java NLP Cluster 4 0.366771 NLP
37 skill_spark Spark NLP Cluster 4 0.365726 NLP
38 skill_machine_learning Machine Learning NLP Cluster 4 0.364681 NLP
39 skill_data_analysis Data Analysis NLP Cluster 4 0.364681 NLP

40 rows Γƒβ€” 5 columns

InΒ [44]:
MAX_JOBS_FOR_NLP_BIPARTITE_PLOT = jobs_clean_nlp.shape[0]
sample_job_nodes_nlp = [
    f"job_{job_id}" for job_id in jobs_clean_nlp.head(MAX_JOBS_FOR_NLP_BIPARTITE_PLOT)["id"]
]
connected_skill_nodes_nlp = set()
for job in sample_job_nodes_nlp:
    if B_nlp.has_node(job):
        connected_skill_nodes_nlp.update(B_nlp.neighbors(job))

subgraph_nodes_nlp = sample_job_nodes_nlp + list(connected_skill_nodes_nlp)
H_bipartite_nlp = B_nlp.subgraph(subgraph_nodes_nlp).copy()

if not H_bipartite_nlp:
    raise ValueError("NLP bipartite subgraph is empty. Ensure jobs_skill_map_dynamic is populated before plotting.")

job_nodes_sub_nlp = [n for n, d in H_bipartite_nlp.nodes(data=True) if d.get("bipartite") == "job"]
skill_nodes_sub_nlp = [n for n in H_bipartite_nlp if n not in job_nodes_sub_nlp]

pos_nlp = {}
pos_nlp.update((node, (0, idx)) for idx, node in enumerate(job_nodes_sub_nlp))
pos_nlp.update((node, (1, idx)) for idx, node in enumerate(skill_nodes_sub_nlp))

plt.figure(figsize=(12, 8))
nx.draw_networkx_nodes(
    H_bipartite_nlp,
    pos_nlp,
    nodelist=job_nodes_sub_nlp,
    node_color="#1f3ab4f0",
    node_size=200,
    label="Jobs",
)
nx.draw_networkx_nodes(
    H_bipartite_nlp,
    pos_nlp,
    nodelist=skill_nodes_sub_nlp,
    node_color="#ff520e",
    node_size=200,
    label="Skills",
)
nx.draw_networkx_edges(H_bipartite_nlp, pos_nlp, alpha=0.3)

plt.legend()
plt.title("NLP Bipartite Job >> Skill Graph")
plt.axis("off")
plt.tight_layout()
plt.show()
No description has been provided for this image
InΒ [45]:
top_nodes_nlp = centrality_df_nlp.head(100)["skill_node"]
H_nlp_proj = skill_projection_nlp.subgraph(top_nodes_nlp).copy()
pos_proj_nlp = nx.spring_layout(H_nlp_proj, seed=42)
nodes = list(H_nlp_proj.nodes())
x = [pos_proj_nlp[n][0] for n in nodes]
y = [pos_proj_nlp[n][1] for n in nodes]
size = [10 + 80 * centrality_nlp.get(n, 0) for n in nodes]

fig = go.Figure()
for u, v, data in H_nlp_proj.edges(data=True):
    fig.add_trace(
        go.Scatter(
            x=[pos_proj_nlp[u][0], pos_proj_nlp[v][0]],
            y=[pos_proj_nlp[u][1], pos_proj_nlp[v][1]],
            mode="lines",
            line=dict(width=max(1, data.get("weight", 1) * 0.2), color="#cccccc"),
            hoverinfo="skip",
            showlegend=False,
        )
    )

fig.add_trace(
    go.Scatter(
        x=x,
        y=y,
        mode="markers+text",
        text=[H_nlp_proj.nodes[n]["label"] for n in nodes],
        textposition="bottom center",
        marker=dict(size=size, color="#9467bd", line=dict(width=1, color="#333")),
        hovertext=[
            f"{H_nlp_proj.nodes[n]['label']}<br>Degree: {H_nlp_proj.degree(n)}"
            for n in nodes
        ],
        hoverinfo="text",
    )
)

fig.update_layout(
    title="NLP Skill Projection (Top 100 nodes)",
    xaxis=dict(visible=False),
    yaxis=dict(visible=False),
    margin=dict(l=20, r=20, t=60, b=20),
)
fig.show()
No description has been provided for this image

NLP Skill IslandsΒΆ

Apply the same weight-thresholded community detection on the NLP skill projection.

InΒ [46]:
weights_nlp = [data["weight"] for _, _, data in skill_projection_nlp.edges(data=True)]
if not weights_nlp:
    raise ValueError("NLP skill projection has no edges to analyze.")
print(min(weights_nlp), max(weights_nlp))

WEIGHT_THRESHOLD_NLP = 20
filtered_edges_nlp = [
    (u, v)
    for u, v, data in skill_projection_nlp.edges(data=True)
    if data.get("weight", 0) >= WEIGHT_THRESHOLD_NLP
]

if not filtered_edges_nlp:
    raise ValueError("No NLP edges survived the weight threshold. Adjust WEIGHT_THRESHOLD_NLP or revisit the skill extraction.")

island_graph_nlp = skill_projection_nlp.edge_subgraph(filtered_edges_nlp).copy()

communities_nlp = list(nx_comm.greedy_modularity_communities(island_graph_nlp, weight="weight"))
community_map_nlp = {
    node: idx for idx, nodes in enumerate(communities_nlp, start=1) for node in nodes
}

community_summary_nlp = pd.DataFrame(
    [
        {
            "community_id": idx,
            "size": len(nodes),
            "sample_skills": ", ".join(
                sorted(island_graph_nlp.nodes[n]["label"] for n in list(nodes)[:5])
            ),
        }
        for idx, nodes in enumerate(communities_nlp, start=1)
    ]
).sort_values("size", ascending=False)

print(f"Detected {len(communities_nlp)} NLP skill communities after thresholding.")
community_summary_nlp.head(20)
1 166
Detected 10 NLP skill communities after thresholding.
Out[46]:
community_id size sample_skills
0 1 62 Chief Artificial Intelligence, Chief Data Officer, Etl, Experience Data, Sql
1 2 50 Azure, Data Engineering, End Data Integration, Enterprise Data Platforms, Warehousing Cloud
2 3 33 Data Engineer New, Data Quality, Insights Ready Decisions, Main Focus Data, Visualization Servic...
3 4 17 Data Platform, Engineering Challenges, Platform Infrastructure, Platform Wide Design, Vision Data
4 5 15 Advances Ai Data, Ai Data Computational, Future Science Access, Roche Advances Ai, Sciences Drug
5 6 15 Code Ai Idea, Figma Platform, Figma Platform Ideas, Figma S Platform, Platform Ideas
6 7 13 Cloud Solutions, Cloud Solutions Oracle, Oci, Products Industries, World Leader Cloud
7 8 3 Hadoop, Title Data, Title Data Engineer
8 9 3 Cnbc, Nbc, Usa
9 10 3 A Senior Data Engineer, Senior Data, Senior Data Engineer
InΒ [47]:
H = island_graph_nlp
pos = nx.spring_layout(H, seed=42, k=0.8)
nodes = list(H.nodes())
node_x = [pos[n][0] for n in nodes]
node_y = [pos[n][1] for n in nodes]
node_sizes = [12 + 30 * centrality_nlp.get(n, 0) for n in nodes]
communities_cat = pd.Categorical([community_map_nlp.get(n, 0) for n in nodes])

fig = go.Figure()
for u, v, data in H.edges(data=True):
    fig.add_trace(
        go.Scatter(
            x=[pos[u][0], pos[v][0]],
            y=[pos[u][1], pos[v][1]],
            mode="lines",
            line=dict(width=max(1, data.get("weight", 1) * 0.15), color="#bbbbbb"),
            hoverinfo="skip",
            showlegend=False,
        )
    )

fig.add_trace(
    go.Scatter(
        x=node_x,
        y=node_y,
        mode="markers+text",
        #text=[H.nodes[n]["label"] for n in nodes],
        #textposition="bottom center",
        marker=dict(
            size=node_sizes,
            color=communities_cat.codes,
            colorscale="Plasma",
            line=dict(width=1, color="#333"),
        ),
        hovertext=[
            f"{H.nodes[n]['label']}<br>Community: {community_map_nlp.get(n)}<br>Degree: {H.degree(n)}"
            for n in nodes
        ],
        hoverinfo="text",
        showlegend=False,
    )
)

fig.update_layout(
    title="NLP Skill Islands (communities on weighted projection)",
    xaxis=dict(visible=False),
    yaxis=dict(visible=False),
    margin=dict(l=20, r=20, t=60, b=20),
)
fig.show()
No description has been provided for this image
InΒ [53]:
print(top_centrality.head(5).to_markdown())
print(top_centrality.iloc[20:24].to_markdown())
|    | skill_node                 | skill_name           | category                                     |   centrality | pipeline   |
|---:|:---------------------------|:---------------------|:---------------------------------------------|-------------:|:-----------|
|  0 | skill_python               | Python               | Programming_Tools                            |     0.811024 | Dictionary |
|  1 | skill_sql                  | Sql                  | Programming_Tools                            |     0.80315  | Dictionary |
|  2 | skill_communication_skills | Communication Skills | Soft_Skills_Misc                             |     0.661417 | Dictionary |
|  3 | skill_aws                  | Aws                  | MLOps_DevOps_Cloud                           |     0.598425 | Dictionary |
|  4 | skill_machine_learning     | Machine Learning     | Machine_Learning_and_Artificial_Intelligence |     0.590551 | Dictionary |
|    | skill_node             | skill_name       | category      |   centrality | pipeline   |
|---:|:-----------------------|:-----------------|:--------------|-------------:|:-----------|
| 20 | skill_python           | Python           | NLP Cluster 4 |     0.613375 | NLP        |
| 21 | skill_sql              | Sql              | NLP Cluster 4 |     0.589342 | NLP        |
| 22 | skill_data_engineer    | Data Engineer    | NLP Cluster 4 |     0.584117 | NLP        |
| 23 | skill_computer_science | Computer Science | NLP Cluster 4 |     0.569488 | NLP        |
InΒ [56]:
fig = px.bar(
    top_centrality,
    x="skill_name",
    y="centrality",
    color="pipeline",
    barmode="group",
    title="Top Skill Centrality by Pipeline",
    labels={"skill_name": "Skill", "centrality": "Degree Centrality"},
)
fig.update_layout(
    xaxis_tickangle=45,
    xaxis_tickmode="linear",
    xaxis_categoryorder="total descending",
    yaxis=dict(title="Degree Centrality", rangemode="tozero"),
    legend_title="Pipeline",
)
fig.show()
No description has been provided for this image

Project SummaryΒΆ

We evaluated Alfred job postings through dual job to skill pipelines, building bipartite graphs, Louvain-based skill islands, and centrality comparisons to pinpoint core skill cohorts job seekers should prioritize for targeted search strategies.

We can see that the NLP-driven skill extraction uncovers broader skills and denser connections, it is also more prone to noise without careful phrase curation. The dictionary-based approach offers precision but may miss emerging or nuanced skills. Combining both methods could yield a comprehensive skill map for job seekers.

The differences in community structures highlight how skill relationships vary based on extraction methods, informing Alfred's job matching algorithms and user guidance. While the job descriptions in this dataset are specific to Alfred's postings, the methodologies applied here can be generalized to other job boards or industries with appropriate adjustments to skill dictionaries and NLP models. Improvement can include implementing a small llm to help rationalize skill extractions and indeed in industry it appears several job boards are implementing some of this functionality already. These include ai powered job description analyzers that suggest skills to add to ones resume based on the job description. But even there, there is significant room for improvement as many of these tools still miss key skills and naunces based of minor textutal differences and human language complexity.

The Projects was succeful in demostrating the power of graphs analysis to help job seekers, for instance I realize that focusing on shapening my SQL skills and python along with big data tools like spark and haddop will make my resume appeal more to recruiters in the data engineering field.